Skip to content

Feat/optimize html streaming#351

Draft
prk-Jr wants to merge 16 commits intomainfrom
feat/optimize-html-streaming
Draft

Feat/optimize html streaming#351
prk-Jr wants to merge 16 commits intomainfrom
feat/optimize-html-streaming

Conversation

@prk-Jr
Copy link
Collaborator

@prk-Jr prk-Jr commented Feb 20, 2026

Summary

This PR combines the core publisher-proxy streaming optimization with the
Next.js RSC follow-up work.

At the platform level, Trusted Server moves from a fully buffered proxy model
to chunked streaming using Fastly stream_to_client(), enabling early header
flush and incremental HTML delivery to reduce TTFB and improve subresource
discovery.

On top of that foundation, the HTML pipeline now supports RSC-aware lazy
accumulation: non-RSC content continues to stream immediately, while only RSC
content that requires post-processing is buffered and rewritten safely. This
preserves correctness for fragmented/cross-script RSC payloads while restoring
meaningful streaming behavior.


Key Changes

  • stream_to_client() Integration (publisher.rs)
    Replaced fully buffered response collection with stream_to_client() to
    enable immediate header dispatch and incremental chunk streaming.

  • lol_html Output Pipeline (streaming_processor.rs)
    Refactored the HtmlRewriter adapter to implement the OutputSink trait
    with a shared Rc<RefCell<Vec<u8>>>, enabling true incremental streaming.

  • Buffer Pre-allocation
    Replaced std::mem::take with Vec::with_capacity and
    std::mem::replace to reduce reallocation churn during chunk processing.

  • WASM Hostcall Batching
    Wrapped StreamingBody output in an 8KB std::io::BufWriter to reduce
    WASM-to-host boundary crossings.

  • RSC Lazy Accumulation (html_processor.rs)
    Added conditional accumulation mode that starts buffering only when
    post-processing is required (for example, RSC placeholders or fragmented
    scripts). Non-RSC pages continue streaming instead of being fully buffered.

  • RSC Post-processing Triggers (nextjs integration)
    Added needs_accumulation support to integration post-processors and
    needs_post_processing detection in placeholder state, including fragmented
    script tracking for fallback re-parse correctness.

  • Memory Safety Guardrail
    Added a 10MB cap for accumulated post-processed HTML to avoid unbounded
    memory growth on large/malicious documents.

  • Routing and Header Consistency (fastly/src/main.rs, publisher.rs)
    Centralized route classification and standardized response-header application
    across buffered and streaming paths.

  • RSC Fixture/Test Expansion
    Added fixture-driven Next.js integration tests (including real Next.js output)
    plus a dedicated example app and scripts for fixture capture and live streaming
    validation.

  • Code Health
    Resolved associated Clippy warnings and added missing # Errors
    documentation in streaming-related handlers.


Test Plan

  • Local Unit & Workspace Tests
    Run:

    cargo test --workspace
  • TypeScript Bundle Build
    Run:

    npm run build

    in crates/js/lib to verify successful generation of integration modules.

  • Next.js RSC Integration Tests
    Run:

    cargo test --test nextjs_integration -- --nocapture

    to validate URL rewriting correctness and streaming behavior across fixture
    sets/chunk sizes.

  • Local Fastly Simulation
    Run:

    fastly compute serve

    Verify:

    • Headers are correctly injected on streamed responses
    • Proxy behavior remains correct
    • Baseline TTFB improvements (for example, via curl)
  • Staging Load Testing
    Execute:

    ./scripts/benchmark.sh

    against staging to quantify external TTFB and Time-to-Last-Byte (TTLB)
    improvements under concurrent traffic.

    Closes

closes #320

prk-Jr and others added 11 commits February 18, 2026 21:33
Introduce RequestTimer for per-request phase tracking (init, backend,
process, total) exposed via Server-Timing response headers. Add
benchmark tooling with --profile mode for collecting timing data.
Document phased optimization plan covering streaming architecture,
code-level fixes, and open design questions for team review.
Introduce RequestTimer for per-request phase tracking (init, backend,
process, total) exposed via Server-Timing response headers. Add
benchmark tooling with --profile mode for collecting timing data.
Document phased optimization plan covering streaming architecture,
code-level fixes, and open design questions for team review.
RequestTimer and Server-Timing header were premature — WASM guest
profiling via profile.sh gives better per-function visibility without
runtime overhead. Also strips dead --profile mode from benchmark.sh.
build.rs already resolves trusted-server.toml + env vars at compile time
and embeds the result. Replace Settings::from_toml() with direct
toml::from_str() to skip the config crate pipeline on every request.
Profiling confirms: ~5-8% → ~3.3% CPU per request.
- OPTIMIZATION.md: profiling results, CPU breakdown, phased optimization
  plan covering streaming fixes, config crate elimination, and
  stream_to_client() architecture
- scripts/profile.sh: WASM guest profiling via --profile-guest with
  Firefox Profiler-compatible output
- scripts/benchmark.sh: TTFB analysis, cold start detection, endpoint
  latency breakdown, and load testing with save/compare support
…ding HTML and RSC Flight URL rewriting, to avoid full-body buffering
@prk-Jr prk-Jr self-assigned this Feb 20, 2026
@prk-Jr
Copy link
Collaborator Author

prk-Jr commented Feb 23, 2026

Performance Benchmark: HTML Streaming Optimization

We ran a comprehensive apples-to-apples benchmark to measure the impact of the feat/optimize-html-streaming branch (which introduces lol_html for streaming <body> transformations instead of buffering).

To ensure statistical accuracy:

  • We increased the sample size to 50 requests.
  • We ran a 10-request deep warmup to eliminate cold-start WASM instantiations.
  • We tested both branches on Staging and Production directly to isolate environment variables.

🚀 The Results: Production

This is the true impact on live users hitting the Fastly Edge:

Metric Baseline (main) Optimization (feat/...) Net Impact
First Byte (Median TTFB) 160.09 ms 144.75 ms 🟢 15.34 ms Faster
First Byte (p95 Tail) 252.78 ms 228.93 ms 🟢 23.85 ms Faster
Total Transfer Time 315.19 ms 345.20 ms 🔶 30.01 ms Slower

📉 The Results: Staging

(Note: Total times are higher here because Staging serves a 190KB uncompressed JS bundle, whereas Prod serves a minified 28KB bundle).

Metric Baseline (main) Optimization (feat/...) Net Impact
First Byte (Median TTFB) 220.44 ms 217.38 ms 🟢 3.06 ms Faster
First Byte (p95 Tail) 478.48 ms 277.93 ms 🟢 200.55 ms Faster
Total Transfer Time 505.28 ms 654.58 ms 🔶 149.30 ms Slower

🎯 Conclusion

The lol_html streaming processor behaves exactly as architecturally intended:

  1. Massive Win for Core Web Vitals: Because we no longer wait for the backend to buffer the entire HTML document before transmitting, the Fastly edge begins sending the <head> tag to the user's browser 15ms sooner on average (and up to 200ms sooner in worst-case staging scenarios). This means the browser can start downloading CSS/JS assets much faster.
  2. Acceptable CPU Overhead: Streaming chunk-by-chunk through the WASM boundary does consume more CPU time. On production hardware, this means the page finishes loading about 30ms later.

Exchanging 30ms of trailing transfer time for 15-20ms of upfront TTFB savings is a highly favorable trade for perceived performance. This branch is safe and recommended for merge.

@prk-Jr prk-Jr linked an issue Feb 23, 2026 that may be closed by this pull request
prk-Jr and others added 4 commits February 23, 2026 20:39
* Optimize Next.js RSC streaming with lazy accumulation

Implement lazy buffering that delays accumulation until RSC content is
detected, improving streaming from 0% to 28-37% for RSC pages while
maintaining 100% URL rewriting correctness.

- Add needs_accumulation() trait for conditional buffering
- Add 10MB memory limit for DoS protection
- Create integration test suite with real Next.js fixtures
- Add example Next.js app for testing

Performance: RSC pages stream 28-37% (theoretical max), non-RSC 96%.

* Preserve publisher fallback headers, centralize route classification, and always clean up live test temp files
@prk-Jr prk-Jr marked this pull request as ready for review February 25, 2026 16:32
@aram356 aram356 marked this pull request as draft February 26, 2026 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable Streaming Chunks for responses to improve TTFB on TS

1 participant